Add F16 precision toolkit (AVX2) + ARM NEON specialist agent#91
Merged
Conversation
simd_avx2.rs — 3 precision tricks, all AVX2-accelerated (additive only):
Trick 1: Double-f16 (Error-Free Split)
f16_double_encode/decode: store value as hi+lo f16 pair
~20-bit effective precision (vs 10-bit single f16)
f16_double_encode/decode_batch: AVX2 F16C + f32x8 addition
Error: ≤2^{-21} × |value| (vs ≤2^{-11} for single f16)
Trick 2: Kahan-compensated accumulation
f16_kahan_sum: O(ε) error instead of O(N·ε) — independent of count
f16_kahan_dot: AVX2 f32x8 multiply + Kahan-accumulate partial sums
Trick 3: Exponent-aligned scaling (F16Scaler)
from_range/from_data: auto-compute scale factor for value range
encode/decode_batch: AVX2 f32x8 scale + F16C convert
Up to ~128× precision improvement for narrow-range data
⚠️ NOT FOR GGUF CALIBRATION — BF16 pipeline is separate
.claude/agents/arm-neon-specialist.md:
Complete ARM SBC knowledge: Pi Zero 2W through Pi 5, Orange Pi 3-5
Per-CPU microarchitecture (A53/A72/A76 pipeline differences)
big.LITTLE awareness (RK3399, RK3588)
F16 inline asm trick, codebook strategy per tier, memory budgets
6 new tests passing. No existing code modified.
https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
simd_avx2.rs — 3 precision tricks, all AVX2-accelerated (additive only):
Trick 1: Double-f16 (Error-Free Split)
f16_double_encode/decode: store value as hi+lo f16 pair
~20-bit effective precision (vs 10-bit single f16)
f16_double_encode/decode_batch: AVX2 F16C + f32x8 addition
Error: ≤2^{-21} × |value| (vs ≤2^{-11} for single f16)
Trick 2: Kahan-compensated accumulation
f16_kahan_sum: O(ε) error instead of O(N·ε) — independent of count
f16_kahan_dot: AVX2 f32x8 multiply + Kahan-accumulate partial sums
Trick 3: Exponent-aligned scaling (F16Scaler)
from_range/from_data: auto-compute scale factor for value range
encode/decode_batch: AVX2 f32x8 scale + F16C convert
Up to ~128× precision improvement for narrow-range data
.claude/agents/arm-neon-specialist.md:
Complete ARM SBC knowledge: Pi Zero 2W through Pi 5, Orange Pi 3-5
Per-CPU microarchitecture (A53/A72/A76 pipeline differences)
big.LITTLE awareness (RK3399, RK3588)
F16 inline asm trick, codebook strategy per tier, memory budgets
6 new tests passing. No existing code modified.
https://claude.ai/code/session_017ZN5PNEf8boFBgorUZVrFU